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Abstract 

We consider the problem of training a binary sequential classifier un¬ 
der an error rate constraint. It is well known that for known densities, 
accumulating the likelihood ratio statistics is time optimal under a fixed 
error rate constraint. For the case of unknown densities, we formulate the 
learning for sequential detection problem as a constrained density ratio 
estimation problem. Specihcally, we show that the problem can be posed 
as a convex optimization problem using a Reproducing Kernel Hilbert 
Space representation for the log-density ratio function. The proposed bi¬ 
nary sequential classifier is tested on synthetic data set and UC Irvine 
human activity recognition data set, together with previous approaches 
for density ratio estimation. Our empirical results show that the classiher 
trained through the proposed technique achieves smaller average sampling 
cost than previous classihers proposed in the literature for the same error 
rate. 


1 Introduction 

Sequential decision strategies outperform their fixed sample size counterparts 
in achieving same decision risk using less number of samples on the average. 
Initially, developed by Wald [T] to reduce the number of inspections in industrial 
quality control, it becomes widely used in clinical studies to reduce the average 
number of patients that are undergoing potentially risky treatments. Even 
when the cost of samples are not a major concern, sequential techniques can be 
used to reduce the computational cost of obtaining relevant information from 
a data sample. Thus sequential test is still a method of great potential in any 
time sensitive scenario. For example, in many computer vision problems, more 
sophisticated feature is usually expensive and slow to obtain even though they 
provide higher accuracy. Therefore cascading classifier such as Viola-Jones[2] is 
widely used due to their sequential nature. 

For the case of known class conditional densities accumulating likelihood 
statistics and comparing with fixed thresholds minimizes the average stopping 
time under fixed error constraints. In this paper, we consider the case where 
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the class conditional densities generating the data is unknown and sequential 
decision rule has to be learned directly from labeled data samples. While there 
exists plethora of supervised learning algorithms to learn fixed sample test rules 
using parametric and non-parametric forms, there exist relatively few algorithms 
designed to learn to perform sequential classification. Unlike the single sample 
classification problems where only the decision boundary is critical, sequential 
decision rules require a mapping from sample space to a state space for ag¬ 
gregation of evidence and making stopping rules. To be concrete we focus on 
the specific problem of learning a binary sequential classifier. The class condi¬ 
tional distribution is assumed to be unknown, but identical and conditionally 
independent over time resulting in a stationary decision/aggregation rule. For 
temporal aggregation of information across samples constructing an estimate 
of the likelihood or posterior probability emerges as an obvious framework for 
constructing sequential rule. 

The information summarizing problem itself has been discussed in [3] and the 
reference therein without considering sequential testing scenario. In the same 
framework of this paper, Sochman and Matas [1] constructed a likelihood ratio 
function estimator using Adaboost [3 El to perform binary sequential classifi¬ 
cation based on accumulation and thresholding of the likelihood ratio estimate, 
resulting in an algorithm called Wald-Boost algorithm. Similarly, other meth¬ 
ods of constructing density ratio estimates based on maximizing information 
theoretic functionals [3 [5] can be employed to perform sequential decisions. 

However, the optimization criteria used by these methods for constructing 
likelihood ratio functions estimates are not directly related to the performance 
in sequential detection. We note that errors in the likelihood estimate effect 
the average stopping time and error probabilities in a non-trivial way due to 
accumulation of errors across samples. Kuh et al. pun] used reinforcement 
learning methods to propagate errors in terminal decisions to adjust weights 
in a parametric likelihood ratio function estimate to learn binary sequential 
classifiers. However, again stopping time is not considered as a direct opti¬ 
mization criteria. In this paper we derive a variational bound on the sampling 
cost of SPRT and associated non-parametric log-density ratio estimate which 
minimizes this bound. Our empirical results show that the sequntial classifier 
trained through the proposed technique achieves smaller average sampling cost 
than learned sequential tests employing likelihood function estimates proposed 
in the literature. 


2 Problem Statement 

In this paper, the problem of learning a binary sequential detector from training 
data is studied. The training data consists of M samples 

from class 0, and N samples |x^^\x 2 ^\-- - ,x^^| from class 1, sampled i.i.d. 

with unknown densities Pq(x) and pi{x) respectively. Each sample Xn^ G 
is a d dimensional feature vector from class c. The learning problem is to 


2 


design a sequential decision making mechanism which consists of an information 
aggregating rule, a stopping criterion and a decision rule to make a terminal 
decisions between the two hypotheses {Ho,Hi} regarding the density used to 
generate series of samples in a test set. Here, the information aggregating 
rule is assumed to be stationary and only use a one dimensional state variable 
summarizing information received up to current sample regarding prevailing 
class label. Recall that in the classic setting for sequential detection with known 
class conditional density. Sequential Probability Ratio Test (SPRT) minimizes 
stopping time for both classes under constraints on miss detection and false 
alarm probability [T]. SPRT compares the cumulative the log-likelihood ratio 
with fixed thresholds to choose between terminal decision or continue to sample: 


Stop and Declare Hq if: 


log < a(PF, Pm) 


Stop and Declare Hi if: 




continue sampling if: 


r(PF,PM) < Vlog^^ < &(Pf,Pm) 

- P0\^i) 


where a and b are respectively the lower and upper terminating boundaries. 
Under zero-overshoot assumption on the accumulated likelihood at stopping 
time , the expected sampling cost for standard binary SPRT is given in m as: 


No = ^[Pplog 4- (1 - PF)log ^ 

Uoi ^ 

Nl = ^pMlog -f(l-PM)log ^ 

1 — -Tf -Tf 


( 1 ) 


Inspired by the structure of the SPRT, an appealing choice for learning a 
binary sequential detector is to construct a function estimate for the likelihood 
ratio function from training samples and design termination and decision rule as 
threshold comparisons as in SPRT. Directly estimating class conditional density 
function independently and computing the ratio results in poor performance [S] 
since fit errors in different regions of the sample space is emphasized when di¬ 
viding the two densities. A number of techniques have been suggested in the 
literature for directly estimating density ratio functions which can be charac¬ 
terized into three classes: parametric approaches [12j that assume a parametric 
form and use regression methods to fit through maximization of binomial likeli¬ 
hood on the training data, boosting based methods [3] that rely on asymptotic 
properties of weighted sum of weak learners and non-parametric techniques [71 [S] 
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that construct likelihood ratio function estimates through maximization of in¬ 
formation theoretic divergence metrics on the training data. 

An estimate of log-posterior ratio can be converted to density ratio estimate 
through canceling the term produced by the prior ratio. For example, a com¬ 
mon parametric form for log-posterior ratio is the additive functions of training 
samples m- 

=F(x;7) (2) 

then the estimate for the density ratio can be formed as: 


_ f(x;7) P(Ho) 

p(Hi) 


(3) 


the parameter vectors {'jrn}m are typically through maximization of the bi¬ 
nomial log-likelihood function. When the model was specified correctly, the 
solution has the asymptotic optimality of a maximum likelihood estimator. In¬ 
terestingly, as shown by Friedman et al. [13] boosting approaches that 

combine binary decision of weak classifiers to train classifiers with improved 
performance can be analyzed under the same framework of fitting an additive 
model through maximization of likelihood. Specifically consider weighted sum 
of binary decisions from weak classifier outputs {/^(x)} with associated weights 

{cj: 

■Fa(x) = ^Ci/i(x) (4) 

i 

The function Fa{'x.) represents the aggregate decision of the ensemble of weak 
classifier. The design process is to iteratively add new classifier to the exist¬ 
ing ones while optimizing the weights associated with them. In boosting, each 
new weak classifier is tasked to minimized a weighted classification error for 
the training set, in which higher weights are assigned to incorrectly classified 
samples using current classifier. Friedman et al. [13] have shown that the itera¬ 
tive weighted minimization procedure is equivalent to minimizing the expected 
exponential error which is a second order approximation to the bi¬ 

nomial log-likelihood function. And they pointed out that the density ratio can 
be retrieved from the final classifier through: 


r(x) 


^ 2Fa(x) P(Ho) 

P(Hi) 


(5) 


The boosting approach is fast with good empirical performance and is resis¬ 
tant to over-fitting and provides a fast approach for constructing density ratio 
estimates. 

In principle all these methods provide density ratio function estimates that 
can be used to form a binary sequential classifier incorporating into the SPRT 
structure. However, the optimization criteria employed in these techniques is 
decoupled from the performance of these function estimates in a sequential 
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decision task. This mismatch results in sub-optimal performance as illustrated 
by our empirical experiments in Section 

In a novel direction, Nguyen et al. [7] derived variational characterizations of 
/-divergences which enabled estimation of divergence functionals and likelihood 
ratios through convex risk minimization. Following this work, we derive a vari¬ 
ational bound on the expected sampling cost of SPRT (with known densities) 
and obtain an associated density ratio estimate r(x). Next, using a Reproduc¬ 
ing Kernel Hilbert Space representation for the log-density ratio function we 
obtain a convex optimization approach for fitting density ratio estimate r(x) to 
training data dubbed as Wald Kernel Density Ratio Fit (WKDRF). 


3 Wald Kernel Density Ratio Fit (WKDRF) 

In this section, we formulate a new algorithm for learning log-density ratio 
function estimates that are tailored for performing sequential binary detection 
in the SPRT decision structure of accumulation and thresholding. Our goal is 
to form a log-density ratio estimate such that the resulting sequential decision 
structure minimizes the average stopping time (or equivalently expected number 
of samples ) for a desired level of probability of error. Towards that end, we 
first extend known results on SPRT error probabilities m to the case of learned 
sequential test based on a given density ratio estimate r(x). 

Theorem 1. In a learned SPRT, if the estimated density ratio function f(-) is 
not constant and normalized as: 


E[r|Ho] = 1 and E[f ^|Hi] = 1 


( 6 ) 


then for fixed lower and upper thresholds a and b on log-likelihood ratio, proba¬ 
bility of false alarm and miss detection of terminal decisions is given by : 


Pf = 


1 - e“ 


and Pm = 




(7) 


We note that learned SPRT performance with normalized density ratio es¬ 
timates matches SPRT performance, albeit with a potentially longer average 
stopping time. To prove Theorem [l] the following Lemma is required. 

Lemma 1. Let Zi = logr(xi), A„ = X]fc=i ^( m ) = E[e“^], under both 
hypotheses the following process is a Martingale: 


Mu 


g«A„ 

G(u)" 


( 8 ) 


5 





Proof. First of all, it is easy to check Mq = 1- Next, one can verify that: 


E[M„+i|Mfc, 0 < fc < n] = - \Mk, 0 < k < n] 


G(u)"+i 

g«z„+i 

= E[ . • Mn\Mk,0 < k <n] 


= M, 


G{u) 
E[e"^"+i] 
G{u) 


= M„ 


And since M„ are all positive valued, we have: 

E[|M„|]=E[M„]=E[Mo] = l 


Thus we proved the process satisfies the two properties to be a Martingale. 

□ 


Next we prove Theorem]^ 

Proof. Define Go(^^) = E[e“^|Ho] and G'i(u) = E[e“^|Hi]. The special case of 
f = 1 satisfies both constraints, but when f = 1 the test never stops. Other 
than that, there is no constantly valued f(-) satisfies both constraints. When 
f is not constantly 1, the test will stop at finite time. Let N be the random 
stopping time, then we have the two types of error when the test stops: 

Pp = PriAw > &|Ho} and Pm = PrlAw < a|Hi} 

For the special case of u = 0, u = — 1 and u = 1: 

Go(0) = l, Gi(0) = l 


and 

Go(l) = J r(x)po(x)dx, Gi(-l) = y f(x)"^pi(x)dx 


Now one can evaluate the expected value of the Martingale Mn at the stopping 
time N under Hq with u = 1, which gives: 


when EfflHol^l 

applying zero-overshooting assumption 




e " 

Ef^^-iHo] = 1 

E[e^«|Ho] = l 
Pf.B + (1 - Pf)A = 1 
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where A = e“ and B = e^. Similarly, by constraining E[f ^|Hi] = 1, one can 
get: 


Pm = 


B-A 


□ 


Next, following the approach used in [7] to develop estimate of divergence 
functionals and likelihood ratio functionals, we derive a variational upper bound 
on the sampling cost of SPRT, which reveals a density ratio function estimate 
linked to the sequential test performance. 


Theorem 2. The average stopping time for SPRT can be upper bounded by the 
solution of the following problem: 


min 

r 


0^0 

f Po log(r) 


OJl 

f Pi log(f) 


s.t. 




(9) 


Proof. Recall that the expected number of sample in the standard SPRT is 
given in 0. Since the terms inside the bracket are constant after fixing the 
error rate, we define the following two constants for simplicity: 

Wo = TTopplog + (1 - PF)log ^ 

i — F’m vf 

and 

wi = TTipM log + (1 - Pm) log 


where ttq and tti are the prior probability of Hg and tti. The standard SPRT 
sampling cost then can be written as: 


Wq 

Doi Dio 

Wq Wi 

fpologr /pilogr-i 


( 10 ) 


Applying similar method as [7], the cost objective can be upper bounded using 
the convex conjugate formula for — log(-) function which is: 


- log(a:) -(1 + log(-a;*)) 


( 11 ) 


as: 


C = 


< 


Wo 


Wl 


fpologr fpilogr- 


Wo 


Wl 


< inf 


supgfpoig ■ r + log(-g) + 1) supjjpi{f ■ + log(-/) + 1) 

Wo Wl 

T f jpi - Po log(-f) +po f fpo + Pi log(-/) + Pi 


( 12 ) 

(13) 

(14) 
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Using the two constraints § and defining the density ratio estimate as f = 
we obtain the variational bound given in 


-/ 

□ 


3.1 Kernel Based Density Ratio Fitting 

If we adopt a Reproducing Kernel Hilbert Space representation for the log- 
density ratio function and replace the class conditional densities with empirical 
distributions defined by the training data, the variational problem given in 
is equivalent to: 

Wo Wi 

M N (15) 

s.t.^f(x^°^) = 1, = l 

i=i 

Next, we impose the Reproducing Kernel Hilbert Space(RKHS) structure to the 
log-density ratio function. Any function in RKHS can be written as an inner 
product form of: 

C 

/(•) =< w, $(•, u) >= ^ acK{-, Uc) 

C=1 


where is the kernel function. In this paper, we choose Gaussian kernel 

with randomly sampled centers as suggested in [5]. Since the objective function 
is a pointwise cost whose minimizer is not unique and could even be infinite di¬ 
mensional, we add a regularization term to penalize the I 2 norm of the estimated 
log-likelihood ratio function which gives: 


Wo 


Wi 


E,=iEc=i«cexp(-- 


^ \ .TV / I 

-) Ei=iEc=i«cexp(-- 


-I- ^q:"^Kq: 


: ^ 


M 


s.t.^e 

i=i 


E C / ' 

ac exp(-- 


.(0). 


N 


= 1 . E 








= 1 


( 16 ) 


The equality constraints can be relaxed to inequality constraints to obtain a 
convex optimization problem: 


_ Wo 

Ej=iEc=i«cexp(-- 
M |, (0) 

i=i 


Wi 


-) Ei=iEc=i«cexp(-^ 


A 




< 1 , 


N 


/ I 
x;p(-- 


.(1) 


< 1 


( 17 ) 















||Xc- —Xc IP 

where K is the kernel matrix with K(i, j) = exp(-—)• One may observe 

that since the two denominator terms are both linear functions of a, convexity 
preserving rule for composition of functions guarantees that the objective being 
convex in a as long as a is properly initialized. Specihcally, - is a convex non¬ 
increasing function for positive valued denominator and the linear function is 
concave, resulting in the composite function being convex when the denominator 
is positive. In addition, we need to guarantee that the first term in 0 has a 
positive denominator while the second term has a negative denominator. This 
can be easily done by performing a proper initialization. Since those exponential 
function coefficients can be viewed as the normal of the hyperplane in terms of 
a, in ( |T7| ) we need to choose the a vector such that it lies in the region that gives 
proper inner product value for both term. A natural yet simple choice of initial 
a could be the normalized equipartitioning vector of the two normal vectors. 
The resulting parameter vector a defines the estimator of the log-density ratio 
function, which summarizes each observation into a log-likelihood to be used 
in a learned SPRT. The testing phase is exactly the same as standard SPRT 
with known density, except that in the learned test the estimated density ratio 
function is used as the information aggregation mapping. The resulting learned 
SPRT automatically satisfies the error constraints with appropriately chosen 
thresholds as shown in Theorem [T] 


4 Experimental Results 

We compare the performance of the learned SPRT using WKDR fitting with 
the performance of Wald-Boost [1] , which is based on AdaBoost training of the 
density ratio function, and the learned SPRT employing KL-divergence density 
ratio fit which fits the density ratio by maximizing the lower bound to the 
one sided KL-divergence. The kernel width in our method and KL-divergence 
fitting method is chosen using cross validation. 




(a) Training data 


(b) Performance curve 


Figure 1: Synthetic example 


We first tested the algorithm in a synthetic data set. In this example, Hq 
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samples are single component Gaussian random vector 


Hi samples are Gaussian mixture as hAf{ 


in testing. For simplicity, we choose the termination boundary to be symmetric 
and the prior probability of the two hypotheses being equal. We use 25 randomly 
picked samples as kernel centers in both WKDRF and KL-Divergence Fitting 
methods. We use 200 stumps as weak classifier in Wald-Boost. In testing phase, 
the same sample is feed into all methods until they terminate, and the termina¬ 
tion time is recorded. A scatter plot for the dataset is given in Figure and 
the empirical performance result is plotted in Figure The proposed method 
outperforms both the KL-divergence based method and outperform Wald-Boost 
in this example achieving lower sampling cost for a given probability of error. 

The second example we present is human activity recognition. The data set 
we used is the smartphone recorded human activity data from the UG Irvine 
Machine Learning Repository website [I4j . Two feature sets are used in our 
evaluation: Features 1-3 which is the mean accelerometer value and Features 
294-296 which is the mean frequency domain accelerometer value. We consider 
two classification tasks: 1) Classification task to determine whether a subject 
is moving or static, 2) Classification of subject that are moving on staircase 
to walking upstairs or downstairs. The size of training data is 3285 and 4067 
respectively for moving v.s. static test, and 1073 and 986 for up v.s. down test. 
We picked 50 randomly chosen samples as kernel centers. Also the number 
of stumps used in Wald-Boost is 200. The data set is plotted in Figure 2a- 
d, and the results are in Figure 2e-h. Again in both classification tasks, our 
method outperforms the other methods. Notably, WaldBoost outperforms KL 
divergence based method in this learning task. 

We note that, if the true density ratio and its inverse are indeed in the RKHS 
function class, then KL-divergence density ratio fitting would result identical 
log-density ratio estimates as our proposed method. Under model mismatch for 
the log-density ratio function, our optimization criteria balances the two errors 
in the SPRT expression to choose the density estimate, arguably resulting in 
better performance in sequential tasks. 


5 Conclusion 

In this work, we proposed a method for learning binary sequential tests based on 
a optimizing a variational bound on sampling cost of SPRT. The proposed al¬ 
gorithm results in an convex program that can be solved efficiently. Experimen¬ 
tal results show that the proposed algorithm outperforms previously proposed 
techniques achieving smaller stopping time for a given error rate. A potential 
direction for future work is characterization of the distance metric between the 
true and estimated log-density ratio metrics when the optimization criteria in 
is utilized and use this metric to study convergence of the proposed method as 
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the number of training samples increase. 
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(c) Walking upstairs v.s. 
downstairs training feature 1-3 



(d) Walking upstairs v.s. 
downstairs training feature 


294-296 



(e) Performance curve, moving (f) Performance curve, moving 
v.s. static using feature 1-3 v.s. static using feature 294-296 




(g) Performance curve, walking 
upstairs v.s. downstairs using 
feature 1-3 


(h) Performance curve, walking 
upstairs v.s. downstairs using 
feature 294-296 


Figure 2: Human activity classification 
12 
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